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Abstract 



Simulated tempering (ST) is an established Markov chain Monte Carlo (MCMC) 
method for sampling from a multimodal density 7t{9). Typically, ST involves intro- 
ducing an auxiliary variable k taking values in a finite subset of [0, 1] and indexing a 
set of tempered distributions, say vrfc(0) oc 7r{9)^. In this case, small values of k en- 
courage better mixing, but samples from vr are only obtained when the joint chain for 
{9,k) reaches k = 1. However, the entire chain can be used to estimate expectations 
under vr of functions of interest, provided that importance sampling (IS) weights are 
calculated. Unfortunately this method, which we call importance tempering (IT), can 
disappoint. This is partly because the most immediately obvious implementation is 
naive and can lead to high variance estimators. We derive a new optimal method for 
combining multiple IS estimators and prove that the resulting estimator has a highly 
desirable property related to the notion of effective sample size. We briefly report on 
the success of the optimal combination in two modelling scenarios requiring reversible- 
jump MCMC, where the nai've approach fails. 



Key words: simulated tempering, importance sampling, Markov chain Monte Carlo 
(MCMC), Metropolis-coupled MCMC 
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1 Introduction 



Markov chain Monte Carlo (MCMC) algorithms, in particular Metropolis-Hastings (MH) 
and Gibbs Sampling (GS), are by now the most widely used methods for simulation-based 
inference in Bayesian statistics. The beauty of MCMC is its simplicity. Very little user input 
or expertise is required in order to establish a Markov chain whose stationary distribution 
is proportional to tt{6), for G C M'^. As long as the chain is irreducible, the theory of 
Markov chains guarantees that sample averages computed from this realisation will converge 
in an appropriate sense to their expectations under vr. However, difficulties can arise when 
TT has isolated modes, between which the Markov chain moves only rarely. In such cases 
convergence is slow, meaning that often infeasibly large sample sizes are needed to obtain 
accurate estimates. 

New MCMC algorithms have been pr oposed to improve mixing. Two related algori thms 
are Metropolis-coupled M CMC (MC^) Jceverl 



1991 



simulated tempering (ST) ( IMarinari and Parisi 



iukushima and Nemoto 



1992 



Geyer and Thompsonl. 



are c 



199,j). 



19961) and 



Both 



osely related to the optimisation technique of simulated annealing (SA) ( iKirkpatrick et al 
19831 ). SA works with a set of tempered distributions TTk{d) indexed by an inverse-temperature 
parameter k G [0, oo). One popular form of tempering is called "powering up", where 
nk{9) oc 71(9)''. Small values of k have the effect of flattening/widening the peaks and raising 
troughs in vr^ relative to vr. 

In MC^ and ST we define a temperature ladder 1 = ki > k2 > ■ ■ ■ > km > 0, and call the ki 
its rungs. Both MC^ and ST involve simulating from the set of m tempered densities vr^^ , . . . , 
TTfc^. MC"^ runs m parallel MCMC chains, one at each temperature, and regularly proposes 
swaps of states at adjacent rungs fcj and fci+i. Usually, samples are only saved from the "cold 
distribution" vr^^. In contrast, ST works with a "pseudo-prior" p{ki) and uses a single chain 



to sample from the joint distribution, which is proportional to 7Tk{6)p{k). Again, it is only at 
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iterations t for which A;*^*^ = 1 that the corresponding reahsation of 6'^*^ is retained. ST has 
an advantage over MC^ in that only one copy of the process {^W : t = 1, ■ ■■,r| is nee ded— 



rather than m — so the chain uses less storage and also has better mixing (IGeyer 



199ll). The 



disadvanta ge is that it needs a go od choice o f pseudo-prior. For further comparison and 



review, see 



Jasra et al. 



fl2007af ) and 



Iba 



fl200lh . 



Both MC^ and ST suffer from inefficiency because they discard all samples from tt^ for 
k ^ 1. The discarded samples could be used to estimate expectations under vr if they 
were given appropri ate i mport ance sampling (IS) weights. For an inclusive review of IS and 



related methods see 



Liul (12001 



Chapter 2). Moreover, it may be the case that an IS estimator 
constructed with samples from a tempered distribution has smaller variance than one based 
on a sample of the same size from vr. As a simple motivating example, let 71(6) = N{6\fi, cr^), 
and consider estimating /i = E^(0) by IS from a tempered distribution iTk{0) oc 71(6*)^. A 
straightforward calculation shows that the value of k which minimises the variance of the IS 
estimator is 



k* 



1/2 



^ + 

2 ^ \ A* 



+ 4 



4>, 1/2 



if /i = 
otherwise. 



Note that k* G (1/2,1) for all fi and cr^. Moreover, one can compute (numerically) k~ = 
k^{a/fi) < k* such that for all k G (A;~,l), the variance of the IS estimator fik based on 
samples from vTfc is smaller than that of one based on a sample of the same size from vr. 
However, Var(/tjt) ^ 00 as ^ for all /i and cr^. Table [1] gives k* and k' for various values 
of 0"//i. 





1/16 


1/4 


1 


4 


16 


k* 


1.00 


0.95 


0.70 


0.52 


0.50 


k- 


0.99 


0.89 


0.42 


0.18 


0.16 



Table 1: Values of k* and k for various values of cr//i. 
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Therefore, there is a trade-off in the choice of tempered IS proposals. On the one hand, 
low inverse-temperatures k in ST can guard against missing modes of vr with large support 
by encouraging better mixing between modes, but can yield very inefficient (IS) estimators 
overall. On the other hand, "lukewarm" temperatures k, especially k G (1/2,1), can yield 
mo re efficien t estim ators within modes than those obtained from samples at = 1. 



posal in IS, and 



JennisonI (119931) was the first t o sug gest using a single tempered distribution as a pro- 



Neall (11996 



2001 



20051 ) has since written several papers combining IS and 
tempering. Indeed, in the discussion of the 1996 paper on tempered transitions, Neal writes 
"simulated tempering allows data associated with pi other than po [the cold distribution] 
to be used to calculate expectations with respect to . . . po (using an importance sampling 
estimator)"0. It is this natural extension that we call importance tempering (IT), with IMC^ 
defined similarly. Given the work of the above-mentioned authors, and the fact that calcu- 
lating importance weights is relatively trivi al, i t may be surprising that successful IT and 



IMC^ applications have yet t o be pubhshed. 



ST with dynamic weighting (IWong and Liangl . 



gorithm (lAtchade and Liul . 



Liul (|2001l ) comes close in proposing to augment 



19971 ) and in applying the Wang-Landau al- 



120071 ) to ST. 

This paper addresses why the straightforward methodology described above has tended 
not to work well in practice, primarily due to a lack of a principled way of combining the 
importance weights collected at each temperature to obtain an overall estimator. If we are 
interested in estimating K^{h{9)}, one way to do this is with 



/i = Vr-i^w;(^(*),A;W)/i(0W), where = ^ w(^W, fc^). 



(2) 



t=i 



t=i 



and w{9,k) = ^{O) /^{OY = 7r(^)^ ^. Observe that this estimator is of the form h 



ET=i^ik where < A, < ZT=l>^^ = 1, with A, = W'^ k^''^)^ 



{fc(t)=fc,}5 



and 



similar note is made in the 2001 paper with regard to annealed importance sampling. 
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where each hi is an IS estimator of E^{/i(^)} constructed using only the observations at the 
inverse-temperature fcj. We show how to improve this estimator by choosing Ai, . . . , to 
maximise the effective sample siz e (se e next paragraph), which approximately corresponds 
to minimising the variance of h ( Liu . 2001 . Section 2.5.3). For the applications that we 
have in mind, it is important that our estimator can be constructed without knowledge of 
the normalising const ants of tt^^, . . . , tt^ . It is for this reason t hat rnethods motivated by 



the balance heuristic ( Veach and Guibas 



1995 



Owen and Zhou 



200G 



Madras and Picconil . 



19991 ) cannot be applied. 

The notion of effective sample size plays an important role in the study of IS esti- 
mators. Suppose we are interested in estimating E^{/i(6')} using a vector of observations 
6 = {9^^\ . . . , 9^'^^) from a density vr'. Define the vector of im port a nce w eights w = w{0) = 



(w(^W),...,w(^(^))), where w{9) = n{9)/n'{9). Following 
define the effective sample size by 



Liu 



(12001 



Section 2.5.3) we 



ESS(w(6>)) = ESS(w) 



T 



cv w 



(3) 



where cv^(w) is the coefficient of variation of the weights, given by 



cv^(w) 



El,iw{9('^) - w) 
(T- 1)^2 



where 



w 



T-i5^w(^W). 



t=i 



his should not b e confused with the concept of effective sample size due to autocorrelation 



(jKass et al 



19981 ) (due to serially correlated samples from a Markov chain). This latter 
notion is discussed briefly in Section HI 

Observe that the swap operations in MC^ require that the state space O be common 
for all m tempered distributions. This is not a requirement for ST, as the state stays fixed 
when changes in temperature are proposed. Thus applying MC'^ is less straightforward 
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in (Bayesian) model selection/averaging problems which t ypically invo lve trans-dimensional 



Markov chains as in reversible-jump MCMC (RJMCMC) (IGreen 



(jjasra et al. 



19951 ) , though it is possible 



2007bl ). Since RJMCMC algorithms are particularly prone to slow mixing, and 
hence are an excellent source of applications of our idea (as illustrated in Section [3]), the rest 
of the paper will focus on IT. Most of our results apply equally to IMC^ by ignoring the 
pseudo-prior. 

The outline of the paper is as follows. In Section [2] we derive the optimal convex combi- 
nation of multiple IS estimators, and show how this estimator has a particularly attractive 
property with regard to its effective sample size. In Section [3] we briefly report on the effec- 
tiveness of optimal IT, and the poor performance of the naive approach, on several real and 
synthetic examples. Section H] concludes with a discussion. 



2 Importance tempering 



The simulated tempering (ST) (IGeyer and Thompson 



19951 ) algorithm is an application of 



MH on the product space of parameters and inverse-temperatures. That is, samples are 
obtained from the joint chain 7r(^, k) oc 7T{6)^p{k). This is only possible if tt{6)'^ is integrable, 
but Holder's inequality may be used to show that this is indeed the case provided that 



EJ\\9\ 



-+S\ 



< oo for some 5 > 0, where 



denotes the Euclidean norm. The success 



of ST depends crucially on the ability of the Markov chain frequently to: (a) visit high 
temperatures (low k) where the probability of escaping local modes is high; (b) visit k = 1 
to obtain samples from tt. The algorithm can be tuned by: (i.) adjusting the number 
and l ocation of the rungs o f the temperature ladder; or (ii.) adjusting the pseudo-prior 



p{k). 



Geyer and Thompson 



the rungs of the ladder. 



995) 



give some automated ways of adjusting the spacing of 



Ibal (120011 ) reviews similar techniques from the physics literature. 



A recent alternative — and very promising — approach involves the Wang-Landau algorithm 
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(lAtchade and Liul . 120071 ). However, many authors prefer to rely on defaults, e.g.. 



;i + Afc 



A-i 



geometric spacing 



{1 + Afe(z — 1)} ^ harmonic spacing 



i = 1, . . . , m. 



(4) 



The rate pa ram e ter > can be problem specific. Motivation for such default spacings is 



outlined by 



Liul (12001 



Chapter 10: pp. 213 fc 23 3). Geometric spacing, or uniform spacing 



of log(/ci), is also advocated by 



Neal 



(11996 



200l|) 



Once a suitable ladder has been chosen, the goal is typically to choose the pseudo-prior 
so that the posterior over temperatures is uniform. The best way to accomplish this is 
to set p{ki) = 1/Zi, where Zi = ^^71(9)^^(19 is the normalising constant in vr^^ = TT^^/Zi, 
which is generally unknown. So while normalising constants are not a prerequisite for ST, 



i t can certainly be useful to know them. We follow the suggestions of 



Geyer and Thompson 



( I1995I ) in setting the pseudo-prior by a r nethod that rough 



stages: first by stochastic approximation (IKushner and Liru . 



y app roximates the Z^ in two- 
I997I ). and then by observation 



counts accumulated through pilot runs. To some extent, a non- uniform posterior on the 
temperatures is less troublesome in the context of IT than ST. So long as the chain still 
visits the heated temperatures often enough to get good mixing in G, and if the ESS of the 
IS estimators at some temperature(s) is not too low, useful samples can be obtained without 
ever visiting the cold distribution. 



2.1 A new optimal way to combine IS estimators 

ST provides us with {(^W,A;W) : t = 1,...,T}, where is an sample from "Kj^h). Write 
% = {t : k^^^ = ki} for the index set of observations at the i^^ temperature, and let Tj = ]%]. 
Let the vector of observations at the i"" temperature collect in 6i = {9ii, . . . ,9iTj, so that 
{%}T=i ~ T^ki- Similarly, the vector of IS weights at the z"" temperature is Wj = Wj(0j) = 
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{wi{9ii), . . .,Wi{9iTj), where Wi{9) = n{9)/nkX0)- 

Each vector 0^ can be used to construct an IS estimator of E^{/i(^^)} by setting 



Ej=i^i(%) 



hi 



w,. 



It is natural to consider an overall estimator of KT^{h{6)} defined by a convex combination: 



where 



< < ^ Ai = 1. 



(5) 



i=l 



Unfortunately, if Ai , . . . , A^, are not chosen carefully, Yaiijix), can be nearly as large as 



the largest Var(/ij) (lOwen and Zhou 
when Ai = 1 and A2 = ■ ■ ■ = Am = 



2000) • Notice that ST is recovered clS 9j special case 
0. It may be tempting to choose Aj = Wi/W, where 
W = Yl^i recovering the estimator in Eq. ([2]). This can lead to a very poor estimator. 



even compared to ST, which is demonstrated empirically in Section [31 
Observe that we can write 



i=l j=l 



where w^- = XiWij/Wi. Let w* 



(6) 
). At- 



tempting 
heuristic, 



:o choose Ai , . . . , A^ to m inimise Var{hx) directly can be difficult. In the balance 



Veach and Guibad (119951 ) explore combinations of IS estimators of the form ([6]) 



where Wi{9) = 7i{9)/gi{9) for a family of proposal densities gi, with 



Cigi{9 



Em 
r=l CrQ 



(7) 



and where < Cj < Q = 1 is the proportion of samples taken from g^. It turns out 
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that this is equivalent to IS with the mixture proposal n{6) = YlT=i ^r9r{G): 



1 ^ 

= ^ ^ w{9t)h{9t), where 
t=i 



w{9) 



71(9) 



Y.T=lCrgr{9)' 



Th e balance heuristic ha s since been generalised by lOwen and Zhoul (120001 ): it was reinvented 



by ([Madras and Picconi 



1999I . Section 4) in the context of applied probability. 
Note that due to the denominator in the definition of w{9) in Eq. ([8]), the gi must be 
normalised densities. This precludes us from using the balance heuristic with Qi oc vr^^. When 
MCMC is necessary to sample from vr, the normalisation constant of vr, and therefore vTfc-, 
is generally unknown. The method also requires evaluations of 7iki{9^^^), i = 1, . . . , m, at all 
T rounds, an 0{mT) operation that trivialises any computational advantage ST has over 
MC^. Instead, we consider maximising the ESS of hx in 1^. 



Proposition 2.1. Among estimators of the form ESS(w'^) is maximised by \ = X* 
where, for i = 1, . . . ,m, 



Em 
i=l 



and 



Proof. Since Xll^i STli ''^Ij ~ 1' problem of maximising the effective sample size is the 



same as 



mm 

Ai,...,A„ 



1=1 j=i ^ 



Wjj 1 

'Wi T 



subject to 



< < = 1. 



i=l 



The result then follows by a straightforward Lagrange multiplier argument. 



□ 



In the following discussion and in Remark 12.21 below, we assume that for i = 1, . . . ,m, 
Ti > 2. The efficiency of each IS estimator hi can be measured through ESS(wj). Intuitively, 
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we hope that with a good choice of A, the ESS of given by 

Fssr A^ r(T-i) 

would be close to the sum over i of the effective sample sizes of hi, namely 

ESS(wO = (9) 

The remark below shows that this is indeed the case for hx* . 
Remark 2.2. We have 

ESS(w^*)>^ESS(w,)----. 

i=l 

Proof. Since ESS(wj) < Tj, it follows from ([9]) that ii < Ti. Thus 

i=i ^ ^ 



r3 



1 1 



4 T' 

i=l 

since x{l — x) attains its maximum ofl/4atx = l/2 and 'Y^ d-i < 'Y^Ti = T . □ 

In practice we have found that this bound is slightly conservative and that often it is 
the case that ESS(w'^*) > YliT=i ESS(wj). Thus our optimally-combined IS estimator has a 
highly desirable and intuitive property in terms of its effective sample size. 
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3 Empirical Results 



Here we briefly report on the success of optimal IT, relative to the naive approach and ST, 
on one simple example and two involving RJMCMC. 



3.1 A simple mixture of normals 

Consider the following toy density vr, a mixture of two normals: 



7r{9) = 0.6N{9\fii = -8, = 0.5^) + 0.4:N{9\fi2 = 8, = 0.9^). (10) 



Table [2] summarises Kolmogorov-Smirnov distances obtained under three IT estimators: ST 
(Ai = 1), naive IT (Aj = Wi/W) and the optimally-combined IT estimator {h\*). Observe 







K-S distance 


Method 


ESS(w^) 


mean var 


ST 


2535 


0.0938 8.5 X 10-4 


naive IT 


17779 


0.0849 1.4 X 10-4 


h* 


22913 


0.0836 5.2 X 10-5 


E.ESS(w,) 


22910 



Table 2: Summary of K-S distances to the true mixture of normals f llOp for ST (Ai = 1), 
naive IT (Aj = Wi/W), the optimally-combined IT estimator (hx*). We used 100 repeated 
samples of size 10^, with tempered RWM proposals. 

that the optimally-combined IT estimator has both the largest ESS and the smallest variance 
of the three estimators, and that ESS(w^*) > ESS(wj). Naive IT improves upon ST in 
this example, but has higher variance than hx*- 



3.2 Bayesian treed Gaussian process models 



Baye sian treed models extend classification and regression tree (CART) models (iBreiman et al. 



19841 ). by putting a prior on the tree structure. We focus on the implementation of ? who 
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fit Gaussian Process (GP) models at the leaves of the tree, specify the tree prior through a 
process that limits its depth, and then define the tree operations grow, prune, change, and 
swap, to allow inference to proceed by RJMCMC. The RJMCMC chain usually identifies the 
correct maximum a posteriori (MAP) tree, but consistently and significantly over estimates 
the posterior probability of deep trees. 

To guard against the transdimensional chain getting stuck in local modes of the posterior, 
? resorted regularly restarting the chain from the null tree. ST provides an alternative by 
increasing the rate of accepted tree operations in higher temperatures. In particular, we find 
that ST can increase the rate of accepted prune operations by an order of magnitude, thus 
enabling the chain to escape the local modes of deep trees. To demonstrate IT we fit a treed 
GP model with ST using a geometric ladder with m = 40 and km = 0.1 to two datasets first 
explored by ?: the 1-d motorcycle accident data and 2-d exponential data. We refer to that 
paper for details about the data and models. 

For the motorcycle accident data the ST chain was run for T = 1.5 x 10^ iterations, where 
a total of Ti = 3732 T/m = 3750) samples were obtained from the cold distribution. That 
ESS(w^*) = 9338 ~ 2.5Ti shows the considerable improvement of IT over ST. Moreover, 
we have ESS(w'^*) > X]j ■E'SS(wj) = 9334. The naive combination Aj = ^ in ([2]) yields 
ESS(w'^) = 285 < jqTi, undermining the very motivation of IT. For the exponential data the 
ST chain was run for a total of T = 5 x 10^ iterations. A total of Ti = 12436 T/m = 12500) 
samples were obtained from the cold distribution. We found that ESS(w'^*) = 21778 ~ 
1.75Ti, illustrating how IT improves on ST. Moreover, we have ESS(w^*) > ~ 
21776. The naive combination Aj = ^ in ([2]) yields ESS(w'^*) = 654 — worse than 

ST. 
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3.3 Mark-Recapture-Recovery Data 



We now consider a Bayesian model selection pro blem with data relatin g to the mark- 



recapture and recovery of shags on the Isle of May (IKing and Brooks 



2OO2I). The three de- 



mographic parameters of interest are: survival rates, recapture rates and recovery rates. The 
models considered for each of the demographic parameters allowed a possible age- and/or 
time-dependence, where the time dependence was conditional on the age structure of the pa- 
rameters. Typically, movement between the different possible models — by adding/removing 
time dependence for a given age group, or updating the age structure of the parameters — is 
slow, with small acceptance probabilities. For f urthe r details of the data, model structure. 



and RJMCMC algorithm see iKing and Brooksl fl2002[ ) 



Using the same ST setup as above, we ran T = 10 iterations and discarded the first 10% 
as burn-in. As with the treed examples, higher temperatures yielded higher acceptance rates 
and an order of magnitude better exploration of model space compared to (untempered) 
RJMCMC. A total of Ti = 248158 T/m = 225000) reahsations were obtained from the 
cold distribution. By comparison, for optimal IT we have ESS(w^*) = 612026 ^ 2.5Ti and 
ESS(w^*) > — 612020. The corresponding naive IT approach (using Aj = ^) 

performed exceptionally poorly, with ESS(w'^) of only 5.43, due to a few large weights 
obtained at hot temperatures. 



4 Discussion 

This paper has addressed the inefficiencies and wastefulness of simulated tempering (ST), 
and related algorithms that are designed to improve mixing in the Markov chain using tem- 
pered distributions. We argued that importance sampling (IS) from tempered distributions 
can produce estimators that are more efficient than ones based on independent sampling, 
provided that the temperature is chosen carefully. This motivated augmenting the ST algo- 
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rithm by calculating importance weights to salvage discarded samples — a technique which 
we have called importance tempering (IT). This idea has been suggested before, but to our 
knowledge little exploration has been carried out for real, complex, applications. We have 
derived optimal combination weights for the resulting collection of IS estimators, which can 
be calculated even when the normalisation constants of the tempered distributions are un- 
known. The weights are essentially proportional to the effective sample size (ESS) of the 
individual estimators, and we found that the resulting combined ESS in this case would be 
approximately equal to their sum. 

We note that the overall success of the optimal IT estimator depends crucially on a 
successful implementation of ST, i.e., having a good temperature ladder and pseudo-prior. 
However, it is also important to recognise that the optimal combination, as a resource- 
efficient post-processing step, is equally applicable in other contexts, i.e., within MC^, or 
even outside of the domain of te mpered MCMC t o com bine any collection IS estimators. 



Sequential Monte Carlo samplers (jPel Moral et al. 



20061 ) may facilitate a natural extension. 



We have illustrated IT on several examples which benefit from the improved mixing ST 
provides. For example, the optimal IT methodology can increase the resulting ESS compared 
to retaining samples only from the cold distribution by roughly a factor of two. 

Since IT involves sampling from a Markov chain, ideally one would take into account 
the serial correlation in the objective criteria for com bining the individ ual estimators. The 



effective sample size due to autocorrelation is defined (IKass et al. 



19981 ) by 



ESS„(6>) = , (11) 

where p(£, 6) is the sample autocorrelation in 6 at lag £; thus for scalar 9 we have that 
p{l,e) = 7(£,0)/7(O,6>), where j{£,6) = (T - ^)"^ Ef=/(^^*^ - 0){e^'+^^ - 6), and 6 = 
Ylt=i ^^^^ ■ The results from the previous section suggest that, when the temperature 
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ladder is fixed, a sensible heuristic might be to consider combining the individual estima- 
tors with weights A* proportional to product of T~^ESSp(0j) and ESS(wj), say. However, 
when considering modifications to the number (m) and spacing of inverse temperatures 
k = {ki, . . . ,km}, there is clearly a conflict of interest between the two measures of effective 
sample size. Adding more inverse temperatures near one may increase ESS(w^*), but may 
also increase autocorrelation in the marginal chain for k. Therefore it may be sensible to 
factor ESSp(k) into the objective as well. Searching for temperature ladders that maximise 
a hybrid of ESS and ESSp would represent a natural extension of this work. 
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